In [ ]:
%matplotlib inline

import time
import calendar
import codecs
import datetime
import sys
import gzip
import string
import glob
import os

# For parsing JSON
import json

Topic 3. JSON - JavaScript Object Notation

Much of the data with which we will work comes in the JavaScript Object Notation (JSON) format. JSON is a lightweight text format that allows one to describe objects by keys and values without needing to specify a schema beforehand (as compared to XML).

Many "RESTful" APIs available on the web today return data in JSON format, and the data we have stored from Twitter follows this rule as well.

Python's JSON support is relatively robust and is included in the language under the json package. This package allows us to read and write JSON to/from a string or file and convert many of Python's types into a text format.

JSON and Keys/Values

The main idea here is that JSON allows one to specify a key, or name, for some data and then that data's value as a string, number, or object.

An example line of JSON might look like:

{"key": "value"}


In [ ]:
jsonString = '{"key": "value"}'

# Parse the JSON string
dictFromJson = json.loads(jsonString)

# Python now has a dictionary representing this data
print ("Resulting dictionary object:\n", dictFromJson)

# Will print the value
print ("Data stored in \"key\":\n", dictFromJson["key"])

# This will cause an error!
print ("Data stored in \"value\":\n", dictFromJson["value"])

Multile Keys and Values

A JSON string/file can have many keys and values, but a key should always have a value. We can have values without keys if we're doing arrays, but this can be awkward.

An example of JSON string with multiple keys is below:

{ "name": "Cody", "occupation": "Student", "goal": "PhD" }

Note the comma after the first two values. These commas are needed for valid JSON and to separate keys from other values.


In [ ]:
jsonString = '{ "name": "Cody", "occupation": "PostDoc", "goal": "Tenure" }'

# Parse the JSON string
dictFromJson = json.loads(jsonString)

# Python now has a dictionary representing this data
print ("Resulting dictionary object:\n", dictFromJson)

JSON and Arrays

The above JSON string describes an object whose name is "Cody". How would we describe a list of similar students? Arrays are useful here and are denoted with "[]" rather than the "{}" object notation. For example:

{ "students": [ { "name": "Cody", "occupation": "Student", "goal": "PhD" }, { "name": "Scott", "occupation": "Student", "goal": "Masters" } ] }

Again, note the comma between the "}" and "{" separating the two student objects and how they are both surrounded by "[]".


In [ ]:
jsonString = '{"students": [{"name": "Cody", "occupation": "PostDoc", "goal": "Tenure"}, {"name": "Scott", "occupation": "Student", "goal": "Masters"}]}'

# Parse the JSON string
dictFromJson = json.loads(jsonString)

# Python now has a dictionary representing this data
print ("Resulting array:\n", dictFromJson)

print ("Each student:")
for student in dictFromJson["students"]:
    print (student)

More JSON + Arrays

A couple of things to note:

  1. JSON does not need a name for the array. It could be declared just as an array.
  2. The student objects need not be identical.

As an example:

[ { "name": "Cody", "occupation": "Student", "goal": "PhD" }, { "name": "Scott", "occupation": "Student", "goal": "Masters", "completed": true } ]


In [ ]:
jsonString = '[{"name": "Cody","occupation": "PostDoc","goal": "Tenure"},{"name": "Scott","occupation": "Student","goal": "Masters","completed": true}]'

# Parse the JSON string
arrFromJson = json.loads(jsonString)

# Python now has an array representing this data
print ("Resulting array:\n", arrFromJson)

print ("Each student:")
for student in arrFromJson:
    print (student)

Nested JSON Objects

We've shown you can have an array as a value, and you can do the same with objects. In fact, one of the powers of JSON is its essentially infinite depth/expressability. You can very easily nest objects within objects, and JSON in the wild relies on this heavily.

An example:

{ "disasters" : [ { "event": "Nepal Earthquake", "date": "25 April 2015", "casualties": 8964, "magnitude": 7.8, "affectedAreas": [ { "country": "Nepal", "capital": "Kathmandu", "population": 26494504 }, { "country": "India", "capital": "New Dehli", "population": 1276267000 }, { "country": "China", "capital": "Beijing", "population": 1376049000 }, { "country": "Bangladesh", "capital": "Dhaka", "population": 168957745 } ] } ] }


In [ ]:
jsonString = '{"disasters" : [{"event": "Nepal Earthquake","date": "25 April 2015","casualties": 8964,"magnitude": 7.8,"affectedAreas": [{"country": "Nepal","capital": "Kathmandu","population": 26494504},{"country": "India","capital": "New Dehli","population": 1276267000},{"country": "China","capital": "Beijing","population": 1376049000},{"country": "Bangladesh","capital": "Dhaka","population": 168957745}]}]}'

disasters = json.loads(jsonString)

for disaster in disasters["disasters"]:
    print (disaster["event"])
    print (disaster["date"])
    
    for country in disaster["affectedAreas"]:
        print (country["country"])

From Python Dictionaries to JSON

We can also go from a Python object to JSON with relative ease.


In [ ]:
exObj = {
    "event": "Nepal Earthquake",
    "date": "25 April 2015",
    "casualties": 8964,
    "magnitude": 7.8
}

print ("Python Object:", exObj, "\n")

# now we can convert to JSON
print ("Object JSON:")
print (json.dumps(exObj), "\n")

# We can also pretty-print the JSON
print ("Readable JSON:")
print (json.dumps(exObj, indent=4)) # Indent adds space

Reading Twitter JSON

We should now have all the tools necessary to understand how Python can read Twitter JSON data. To show this, we'll read in a single tweet from the Ferguson, MO protests review its format, and parse it with Python's JSON loader.


In [ ]:
tweetFilename = "first_BlackLivesMatter.json"

# Use Python's os.path.join to account for Windows, OSX/Linux differences
tweetFilePath = os.path.join("..", "00_data", "ferguson", tweetFilename)

print ("Opening", tweetFilePath)

# We use codecs to ensure we open the file in Unicode format,
# which supports larger character encodings
tweetFile = codecs.open(tweetFilePath, "r", "utf8")

# Read in the whole file, which contains ONE tweet and close
tweetFileContent = tweetFile.read()
tweetFile.close()

# Print the raw json
print ("Raw Tweet JSON:\n")
print (tweetFileContent)

# Convert the JSON to a Python object
tweet = json.loads(tweetFileContent)
print ("Tweet Object:\n")
print (tweet)

# We could have done this in one step with json.load() 
# called on the open file, but our data files have
# a single tweet JSON per line, so this is more consistent

Twitter JSON Fields

This tweet is pretty big, but we can still see some of the fields it contains. Note it also has many nested fields. We'll go through some of the more important fields below.


In [ ]:
# What fields can we see?
print ("Keys:")
for k in sorted(tweet.keys()):
    print ("\t", k)

print ("Tweet Text:", tweet["text"])
print ("User Name:", tweet["user"]["screen_name"])
print ("Author:", tweet["user"]["name"])
print("Source:", tweet["source"])
print("Retweets:", tweet["retweet_count"])
print("Favorited:", tweet["favorite_count"])
print("Tweet Location:", tweet["place"])
print("Tweet GPS Coordinates:", tweet["coordinates"])
print("Twitter's Guessed Language:", tweet["lang"])

# Tweets have a list of hashtags, mentions, URLs, and other
# attachments in "entities" field
print ("\n", "Entities:")
for eType in tweet["entities"]:
    print ("\t", eType)
    
    for e in tweet["entities"][eType]:
        print ("\t\t", e)

In [ ]: